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Abstract 

A new type of logs, the command log, is being employed to 
replace the traditional data log (e.g., Aries log) in the in¬ 
memory databases. Instead of recording how the tuples are 
updated, a command log only tracks the transactions being 
executed, thereby effectively reducing the size of the log and 
improving the performance. Command logging on the other 
hand increases the cost of recovery, because all the transac¬ 
tions in the log after the last checkpoint must be completely 
redone in case of a failure. In this paper, we first extend the 
command logging technique to a distributed environment, 
where all the nodes can perform recovery in parallel. We 
then propose an adaptive logging approach by combining 
data logging and command logging. The percentage of data 
logging versus command logging becomes an optimization 
between the performance of transaction processing and re¬ 
covery to suit different OLTP applications. Our experimen¬ 
tal study compares the performance of our proposed adaptive 
logging, ARiES-style data logging and command logging on 
top of H-Store. The results show that adaptive logging can 
achieve a lOx boost for recovery and a transaction through¬ 
put that is comparable to that of command logging. 

1. Introduction 

Harizopoulos et al. {9} show that in in-memory databases, 
substantial amount of time is spent in logging, latching, lock¬ 
ing, index maintenance, and buffer management. The exist¬ 
ing techniques in relational databases will lead to suboptimal 
performance for in-memory databases, because the assump¬ 
tion of I/O being the main bottleneck is no longer valid. For 
instance, in conventional databases, the most widely used 
logging approach is the write-ahead log (e.g., Aries log 
ED)- Write-ahead logging records the history of transac- 
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Figure 1: Example of logging techniques 

tional updates to the data tuples, and we shall refer to it as 
the data log in this paper. 

Consider an example shown in Figure [T] There are two 
nodes processing four concurrent transactions, T) to T 4 . All 
the transactions follow the same format: 

f(x,y) :y = 2x 

So T\ = f{A,B), indicating that T\ reads the value of A 
and then updates B as 2 A. Since different transactions may 
modify the same value, there should be a locking mecha¬ 
nism. Based on their timestamps, the correct serialized order 
of the transactions is 7), T 2 , T 3 and T 4 . Let v(X) denote 
the value of parameter X. The Aries data log of the four 
transactions are listed as below: 

Table 1: Aries log 


timestamp 

transaction ID 

parameter 

old value 

new value 

100001 

Ti 

B 

v(B) 

2v(A) 

100002 

t 2 

G 

v(G) 

2v(C) 

100003 

t 3 

B 

v(B) 

2v(D) 

100004 

Ti 

D 

v(D) 

2v(G) 


Aries log records how the data are modified by the 
transactions, and by using the log data, we can efficiently 
recover if there is a node failure. However, the recovery 
process of in-memory databases is sightly different from 
that of conventional disk-based databases. To recover, an in¬ 
memory database first loads the database snapshot recorded 
in the last checkpoint and then replays all the committed 
transactions in Aries log. For uncommitted transactions, no 
roll-backs are required, since uncommitted writes will not be 
reflected onto disk. 

Aries log is a “heavy-weight” logging approach, as it 
incurs high overheads. In conventional database systems, 
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where I/Os for processing transactions dominate the per¬ 
formance, the logging cost is tolerable. However, in an in¬ 
memory system, since all the transactions are processed in 
memory, logging cost becomes a dominant cost. 

To reduce the logging overhead, a command logging ap¬ 
proach lfT9l was proposed to only record the transaction in¬ 
formation with which transaction can be fully replayed when 
facing a failure. In H-Store lfl4l . each command log records 
the ID of the corresponding transaction and which stored 
procedure is applied to update the database along with input 
parameters. As an example, the command logging records 
for Figure [I] are simplified as below: 

Table 2: Command log 


transaction ID 

timestamp 

procedure pointer 

parameters 

1 

100001 

P 

A, B 

2 

100002 

P 

C.G 

3 

100003 

P 

D. B 

4 

100004 

P 

G, D 


As all four transactions follow the same routine, we only 
keep a pointer p to the details of storage procedure: /( x, y) : 
y = 2x. For recovery purposes, we also need to maintain the 
parameters for each transaction, so that the system can re- 
execute all the transactions from the last checkpoint when a 
failure happens. Compared to ARiES-style log, a command 
log is much more compact and hence reduces the I/O cost 
for materializing it onto disk. It was shown that command 
logging can significantly increase the throughput of transac¬ 
tion processing in in-memory databases ED. However, the 
improvement is achieved at the expense of its recovery per¬ 
formance. 

When there is a node failure, all the transactions have 
to be replayed in the command logging approach, while 
ARiES-style logging simply recovers the value of each col¬ 
umn (Note that throughout the paper, we use ’’attribute” to 
refer to a column defined in the schema and ’’attribute value” 
to denote the value of a tuple in a specific column). For ex¬ 
ample, to fully redo T), command logging needs to read the 
value of A and update the value of B, while if Aries log¬ 
ging is adopted, we just set B’ s value as 2 v(A) as recorded 
in the log file. More importantly, command logging does 
not support parallel recovery in a distributed system. In the 
command logging lH9l . command logs of different nodes are 
merged at the master node during recovery, and to guaran¬ 
tee the correctness of recovery, transactions must be repro¬ 
cessed in a serialized order based on their timestamps. For 
example, even in a network of two nodes, the transactions 
have to be replayed one by one due to their possible compe¬ 
tition. For the earlier example, T 3 and X 4 cannot be concur¬ 
rently processed by node 1 and 2 respectively, because both 
transactions need to lock the value of D. For comparison, 
ARiES-style logging can start the recovery in node 1 and 2 
concurrently and independently. 

In summary, command logging reduces the I/O cost of 
processing transactions, but incurs a much higher cost for re¬ 
covery than ARiES-style logging, especially in a distributed 


environment. To this end, we propose a new logging scheme 
which achieves a comparable performance as command log¬ 
ging for processing transactions, while enabling a much 
more efficient recovery. Our logging approach also allows 
the users to tune the parameters to achieve a preferable trade¬ 
off between transaction processing and recovery. 

In this paper, we first propose a distributed version of 
command logging. In the recovery process, before redoing 
the transactions, we first generate the dependency graph by 
scanning the log data. Transactions that read or write the 
same tuple will be linked together. A transaction can be re¬ 
processed only if all its dependent transactions have been 
processed and committed. On the other hand, transactions 
that do not have dependency relationship can be concur¬ 
rently processed. Based on this principle, we organize trans¬ 
actions into different processing groups. Transactions inside 
a group have dependency relationship, while transactions of 
different groups can be processed concurrently. 

While distributed version of command logging effec¬ 
tively exploits the parallelism among the nodes to speed up 
recovery, some processing groups can be rather large, caus¬ 
ing a few transactions to block the processing of many oth¬ 
ers. We subsequently propose an adaptive logging approach 
which adaptively makes use of the command logging and 
ARiES-style logging. More specifically, we identify the bot¬ 
tlenecks dynamically based on our cost model and resolve 
them using Aries logging. We materialize the transactions 
identified as bottlenecks in Aries log. So transactions de¬ 
pending on them can be recovered more efficiently. 

It is indeed very challenging to classify transactions into 
the ones that may cause bottleneck and those that will not, 
because we have to make a real-time decision on either 
adopting command logging or Aries logging. During trans¬ 
action processing, we do not know the impending distribu¬ 
tion of transactions. Even if the dependency graph of im¬ 
pending transactions is known before the processing starts, 
we note that the optimization problem of log creation is still 
an NP-hard problem. Hence, a heuristic approach is subse¬ 
quently proposed to find an approximate solution based on 
our model. The idea is to estimate the importance of each 
transaction based on the access patterns of existing transac¬ 
tions. 

Finally, we implement our two approaches, namely dis¬ 
tributed command logging and adaptive logging, on top of 
H-Store Q3 and compare them with Aries logging and 
command logging. Our results show that adaptive logging 
can achieve a comparable performance for transaction pro¬ 
cessing as command logging, while it performs 10 times 
faster than command logging for recovery in a distributed 
system. 

The rest of the paper is organized as follows. We present 
our distributed command logging approach in Section 2 and 
the new adaptive logging approach in Section 3. The experi¬ 
mental results are presented in Section 4 and we review some 
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related work in Section 5. The paper is concluded in Section 

6 . 

2. Distributed Command Logging 

As described earlier, the command logging COD only records 
the transaction ID, storage procedure and its input parame¬ 
ters. If some servers fail, the database can restore the last 
snapshot and redo all the transactions in the command log to 
re-establish the database state. Command logging operates 
at a much coarser granularity and writes much fewer bytes 
per transaction than ARiES-style logging. 

However, the major concern of command logging is its 
recovery performance. In VoltDBQ command logs of dif¬ 
ferent nodes are shuffled to the master node which merges 
them using the timestamp order. Since command logging 
does not record how the data are manipulated, we must redo 
all transactions one by one, incurring high recovery over¬ 
head. An alternative solution is to maintain multiple replicas 
01 l5l ITOll26l [30l . so that data on the failed node can be recov¬ 
ered from their replicas. However, the drawback of such ap¬ 
proach is twofold. First, keeping consistency between repli¬ 
cas incurs high synchronization overhead, further slowing 
down the transaction processing. Second, given a limited 
amount of memory, it is too expensive to maintain replicas 
in memory. Therefore, in this paper, we focus on the log- 
based approaches, although we also show the performance 
of a replication-based technique in our experimental study. 

Before delving into the details of our approach, we first 
define the correctness of recovery in our system. Suppose 
the data are partitioned to N cluster nodes. Let T be the set 
of transactions since the last checkpoint. For a transaction 
ti £ T, if reads or writes a tuple on node n x £ N, n x 
becomes a participant of ti. Specifically, we use f(U) to re¬ 
turn all those nodes involved in t, and we use f~ 1 (n x ) to 
represent all the transactions in which n x has participated. 
In a distributed system, we will assign each transaction a co¬ 
ordinator, typically the node that minimizes the data transfer 
for processing the transaction. The coordinator schedules the 
data accesses and monitors how its transaction is processed. 
Hence, we only need to create a command log entry in the 
coordinator DU- We use 6{ti) to denote ti s coordinator. 
Obviously, we have 9(ti) £ f(ti). 

Given two transactions ti £ T and tj £ T, we define 
an order function -< as: ti -< tj, only if t, is committed 
before tj. When a node n x fails, we need to redo a set of 
transactions / _1 (n x ). But these transactions may compete 
for the same tuple with other transactions. Let s(L) and c(L) 
denote the submission time and commit time respectively. 

Definition 1. Transaction Competition 

Transaction ti competes with transaction tj, if 

1. s(tj) < s(ti) < C{tj). 

2 . ti and tj read or write the same tuple. 

1 http://voltdb.com/ 
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Figure 2: A running example 

Note that we define the competition as a unidirectional 
relationship. £,; competes with tj, and t :l may compete with 
others which may modify the same set of tuples and commit 
before it. Let 0(U) be the set of all the transactions that ti 
competes with. We define function g for transaction set Tj 
as: 

g(Tj)= U ®(*0 

vt.er. 

To recover the database from n x ’s failure, we create an initial 
recovery set Tf = / - 1 (n x )and &etT x +1 = Tf Gg(T x ). As 
we have a limited number of transactions, we can find a L 
satisfying when j > L, we have T x +1 = Tf . This is because 
there are no more transactions accessing the same set of 
tuples since the last checkpoint. We call Tf the complete 
recovery set for n x . 

Finally, we define the correctness of recovery in the dis¬ 
tributed system as: 

Definition 2. Correctness of Recovery 

When node n x fails, we need to redo all the transactions in 
its complete recovery set by strictly following their commit 
order, e.g., if ti © tj, then ti must be reprocessed before tj. 

To recover from a node’s failure, we need to retrieve its 
complete recovery set. For this purpose, we build a depen¬ 
dency graph. 

2.1 Dependency Graph 

Dependency graph is defined as an acyclic direct graph G = 
(V,E), where each vertex v, in V represents a transaction 
ti, containing the information about its timestamp (c(t,) and 
s(ti)) and coordinator 9(ti). w, has an edge eij to Vj, iff 

1 . ti in 0 (tj) 

2. \/t m £ 0(tj),c(t m ) < c(U ) 

For a specific order of transaction processing, there is 
one unique dependency graph as shown in the following 
theorem. 

Theorem 1. Given a transaction set T = {to, ...,tk}, 
where c(ti) < c(fj+i), we can generate a unique depen¬ 
dency graph for T. 

PROOF 1 . Since the vertices represent transactions, we al¬ 
ways have the same set of vertices for the same set oftrans- 
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actions. We only need to prove that the edges are also unique. 
Based on the definition, edge etj exists, only if tj accesses 
the same set of tuples that ti updates and no other trans¬ 
actions that commit after ti have that property. As for each 
transaction tj, there is only one such transaction ti. There¬ 
fore, edge eij is a unique edge between ti and tj. 

We use Figure [2] as a running example to illustrate the 
idea. In Figure [2] there are totally seven transactions since 
the last checkpoint, aligned based on their coordinators: 
node 1 and node 2. We show transaction IDs, timestamps, 
storage procedures and parameters in the table. Based on 
the definition, transaction 1 2 competes with transaction t \, 
as both of them update x\. Transaction ti competes with 
transaction 1 2 on x 3 . The complete recovery set for £4 is 
{ti,t 2 ,tf\. Note that although ti does not access the at¬ 
tribute that ti updates, t\ is still in tfs recovery set because 
of the recursive dependency between f , £ 2 and ti. After con¬ 
structing the dependency graph and generating the recovery 
set, we can adaptively recover the failed node. For example, 
to recover node 1 , we do not have to reprocess transactions 

and £5. 

2.2 Processing Group 

In order to generate the complete recovery set efficiently, we 
organize transactions as processing groups. Algorithm[l]and 
[ 2 ] illustrate how we generate the groups from a dependency 
graph. In Algorithm [I] we start from the root vertex that 
represents the checkpoint to iterate all the vertices in the 
graph. The neighbors of the root vertex are transactions that 
do not compete with the others. We create one processing 
group for each of them (line 3-7). AddGroup is a recursive 
function that explores all reachable vertices and adds them 
into the group. One transaction can exist in multiple groups 
if more than one transaction competes with it. 


Algorithm 1 CreateGroup (DependencyGraph G) 

1: SetS = 0 

2: Vertex v = G.getRoot() 

3: while u.hasMoreEdgeQ do 
4: Vertex vq = 7 ;.gctHdgcp.cndVertex() 

5: Group g = new GroupO 

6: AddGroup©, no) 

7: S.add(g) 

8: return S 


Algorithm 2 AddGroup (Group g, Vertex v) 

1: g.add(u) 

2: while u.hasMoreEdgeQ do 
3: Vertex vo = u.getEdge().endVertex() 

4: AddGroup©, vq) 


2.3 Algorithm for Distributed Command Logging 

When we detect that node n x fails, we stop the transaction 
processing and start the recovery process. One new node 


starts up to reprocess all the transactions in n x ’s complete re¬ 
covery set. Because some transactions are distributed trans¬ 
actions involving other nodes, the recovery algorithm runs 
as a distributed process. 

Algorithm [3] shows the basic idea of recovery. First, we 
retrieve all the transactions that do not compete with the oth¬ 
ers since the last checkpoint (line 3). These transactions can 
be processed in parallel. Therefore, we find their coordina¬ 
tors and forward them correspondingly for processing. At 
each coordinator, we invoke Algorithm [4] to process a spe¬ 
cific transaction t. We first wait until all the transactions in 
©(£) are processed. Then, if t has not been processed yet, 
we will process it and retrieve all its neighbor transactions 
following the links in the dependency graph. If those trans¬ 
actions are also in the recovery set, we recursively invoke 
function ParallelRecovery to process them. 

Algorithm 3 Recover (Node n x , 

DependencyGraph G) 

1: Set St = getAHTransactions(n x ) 

2: CompleteRecoverySet S=getRecoverySet(CVS ©) 

3: Set Sr = getRootTransactions(S) 

4: for Transaction t E Sr do 
5: Node n = t.getCoordinator() 

6: Parallel Recovery©, St. t) 


Algorithm 4 ParallelRecovery (Node n, Set 
St r Transaction t) 

1: while wait(©(f)) do 
2: sleep(timethreshold) 

3: if t has not been processed then 
4: process(f) 

5: Set St = g.getDescendant(t) 

6: for Vt; E St n St do 

7: Node n; = f.getCoordinatorQ 

8: ParallelRecovery©;, St, ti) 


Theorem 2. A IgorithmjJjguarantees the correctness of the 
recovery. 

PROOF 2. In Algorithm [5] if two transactions ti and tj are 
in the same processing group and c(tf) < c[tj), £,; must be 
processed before tj, as we follow the links of dependency 
graph. The complete recovery set of tj is the subset of the 
union of all the processing groups that tj joins. Therefore, 
we will redo all the transactions in the recovery set for a 
specific transaction as in Algorithm^ 

As an example, suppose node 1 fails in Figure [2] The 
recovery set is {£ 1 , £ 2 , £ 4 , t§, £ 7 }. We will first redo £1 in node 
2 which is the only transaction that can run without waiting 
for the other transactions. Note that although node 2 does 
not fail, we still need to reprocess £ 1 , because it modifies the 
tuples that are accessed by those failed transactions. After 
£1 and £2 commit, we will ask the new node which replaces 
node 1 to reprocess £4. Simultaneously, node 2 will process 
£ 6 in order to recover £ 7 . 
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| checksum | LSN | recordlype | jnsert/updaie/deTete | nansaciion id | i i: 11 - i 0 

H tablename | primary key | modified column list | before image | after image | 

(a) ARIES Log 


| checksum | transaction id | table ID | tuple ID |-► 



tabje in | tuple ll~| 


(b) Footprint Log 


Figure 3: Aries Log VS Footprint Log 

2.4 Footprint of Transactions 

To reduce the overhead of transaction processing, a depen¬ 
dency graph is built offline. Before a recovery process starts, 
we scan the log to build the dependency graph. For this pur¬ 
pose, we introduce a light weight footprint for transactions. 
Footprint is a specific type of write ahead log. Once a trans¬ 
action is committed, we record the transaction ID and the 
involved tuple ID as its footprint. Figure [3] illustrates the 
structures of footprint and Aries log. Aries log maintains 
detailed information about a transaction, including partition 
ID, table name, modified column, original value and updated 
value, based on which we can successfully redo a transac¬ 
tion. On the contrary, footprint only records IDs of those 
tuples that are read or updated by a transaction. It incurs 
much less storage overhead than Aries log (on average, 
each record in Aries log and footprint requires 3KB and 
450B respectively) and hence, does not significantly affect 
the performance of transaction processing. The objective of 
recording footprints is not to recover lost transactions, but to 
build the dependency graph. 

3. Adaptive Logging 

The bottleneck of our distributed command logging is 
caused by dependencies among transactions. To ensure 
causal consistency sa, transaction t is blocked until all the 
transactions in 0(f) have been processed. If we fully or 
partially resolve dependencies among transactions, the over¬ 
head of recovery can be effectively reduced. 

3.1 Basic Idea 

As noted in the introduction, Aries log allows each node 
to recover independently. If node n x fails, we just load its 
log data since the last checkpoint to redo all updates. We do 
not need to consider the dependencies among transactions, 
because the log completely records how a transaction mod¬ 
ifies the data. Hence, the intuition of our adaptive logging 
approach is to combine command logging and Aries log¬ 
ging. For transactions highly dependent on the others, we 
create Aries log for these transactions to speed up their re¬ 
processing. For other transactions, we apply command log¬ 
ging to reduce logging overhead. 

For example, in Figure [2] if we create Aries log for f 7 , 
we do not need to reprocess f 6 to recover node 1. Moreover, 


if Aries log has been created for f 2 , we just need to redo f 2 
and then f 4 , and the recovery process does not need to start 
from f-|. In this case, to recover node 1, only three transac¬ 
tions need to be re-executed, namely {f 2 ,f 4 ,f 7 }. To deter¬ 
mine whether a transaction depends on the results of other 
transactions, we need a new relationship other than the trans¬ 
action competition that describes the causal consistency. 

Definition 3. Time-Dependent Transaction 

Transaction tj is U ’s time-dependent transaction, if 1) 
c(ti ) > c{tj); 2) tj updates tuple an attribute a x of tuple 
r which is accessed by ti; and 3) there is no other transac¬ 
tion with commit time between c(f ) and c(tj) which also 
updates r.a x . 

Let 0 (tj) denote all tf s time-dependent transactions. For 
transactions in 0 (L), we can recursively find their own time- 
dependent transactions, denoted as 0 2 (i,;) = 0(0(ii)). This 
process continues until we find the minimal x satisfying 
0 X (L) = 0 x+1 (ti). 0 X (L) represents all transactions that 
must run before ti to guarantee the causal consistency. For a 
special case, if transaction ti does not compete with the oth¬ 
ers, it does not have time-dependent transactions (namely, 
0 (f,:) = 0 ) either. 0 x (fi) is a subset of the complete re¬ 
covery set of fj. Instead of redoing all the transactions in 
the complete recovery set, we only need to process those in 
0 x (fi) to guarantee that ti can be recovered correctly. 

If we adaptively select some transactions in 0*(L) to 
create Aries logs, we can effectively reduce the recovery 
overhead of f,:- That is, if we have created Aries log for 
transaction tj, ®(tj) = 0 and <& x (tj) = 0 , because tj now 
can recover by simply loading its Aries log (in other words, 
it does not depend on the results of the other transactions). 

More specifically, let A = {oo, a \,..., a m } denote the 
attributes that t z needs to access. These attributes may come 
from different tuples. We use tg)(ti.a x ) to represent the 
time-dependent transactions that have updated a x . There¬ 
fore, 0(fj) = 0(L;.ao) U ... U 0(L.a m ). To formalize how 
Aries log can reduce the recovery overhead, we introduce 
the following lemmas. 

LEMMA 1. If we have created an ARIES log for tj £ 0 ((,;), 
transactions ti in 0*~ 1 (i ) ) can be discarded from ® x (ti), if 

flt m £ 0 (L),f m =tiWti£ 0 * - 1 (f m ) 

PROOF 3. The lemma indicates that all the time-dependent 
transactions of tj can be discarded, if they are not time- 
dependent transactions of the other transactions in 0(L), 
which is obviously true. 

The above lemma can be further extended for a random 
transaction in 0*(i,:). 

LEMMA 2. Suppose we have created an ARIES log for 
transaction tj G 0 X (L) which updates attribute set A. 
Transaction f G 0 x (f 7 ) can be discarded, if 

fla, x £ (A — A), ti G 0 1 (ti.a x ) 
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PROOF 4. Because tj updates A, all the transactions in 
® x (tj) that only update attribute values in A can be dis¬ 
carded without violating the correctness of casual consis¬ 
tency. 

The lemma shows that all tf s time-dependent transac¬ 
tions are not necessary in the recovery process, if they are not 
time-dependent transactions of any attribute in {A — A). To 
recover the values of attribute set A for f , we can start from 
fj’s Aries log to redo tj and then all transactions which also 
update A and have timestamps in the range of ( c(tj ), c(tj)). 
To simplify the presentation, we use <p(tj,ti,tj.A) to denote 
these transactions. 

Finally, we summarize our observations as the following 
theorem, based on which we design our adaptive logging and 
recovery algorithm. 

THEOREM 3. Suppose we have created Aries logs for 
transaction set T a . To recover ti, we need to redo all the 
transactions in 

|J ® x {ti-a x ) U (J tj.A) 

(J vtj€T a tjA) VtjtTa. 

PROOF 5. The first term represents all the transactions that 
are required to recover attribute values in (A— (J vt . tj .A) 
The second term denotes all those transactions that we need 
to do by recovering from ARIES logs and following the 
timestamp order. 


define an attribute set: 

P(T a ,tj)= (J t x .A 

Vt x £T A c(t x )>c(tj) 

p(Ta,tj) represent the attributes that are updated after tj 
by the transactions with Aries logs. Therefore, A(f ? . tfi is 
adjusted as 

A (tj^Ta) = Y r cmd (t x )-r aries (tj)- 

Vt x e®*(ti)-Ta 

Y r cmd (t x ) (2) 

.A—p(T a ,tj)) 

Definition 4. Optimization Problem 

For transaction ti, finding a transaction set T a to create 
Aries logs so that eT A(tj, U, Ta) is maximized with 
the condition ^vt-eT tv aries (tj ) < Bi/ a . 

Note that this is a simplified version of optimization prob¬ 
lem, as we only consider a single transaction for recovery. In 
real systems, if node n x fails, all the transactions in f(n x ) 
should be recovered. 

The single transaction case of optimization is analogous 
to the 0-1 knapsack problem, while the more general case is 
similar to the multi-objective knapsack problem. It becomes 
even harder when function A is also determined by the 
correlations of transactions. 

3.2.1 Offline Algorithm 


3.2 Logging Strategy 

By combining Aries logging and command logging into a 
hybrid logging approach, we can effectively reduce the re¬ 
covery cost. Given a limited I/O budget Bi/ 0 , our adaptive 
approach selects the transactions for Aries logging to maxi¬ 
mize the recovery performance. This decision has to be made 
during transaction processing, where we determine which 
type of logs to create for each transaction before it commits. 
However, since we do not know the future distribution of 
transactions, it is impossible to generate an optimal selec¬ 
tion. In fact, even we know all the future transactions, the 
optimization problem is still NP-Hard. 

Let w aries (tj) and w cmd (tj ) denote the I/O costs of 
Aries logging and command logging for transaction tj 
respectively. We use r aries (tj ) and r cmd (tj ) to represent 
the recovery cost of tj regarding to the Aries logging and 
command logging respectively. If we create an Aries log 
for transaction tj that is a time-dependent transaction of ti, 
the recovery cost is reduced by: 


A(fj, ti ) — 


c 

Y r 

= 6® x (ti) 


\t*)~ Y 


.A) 


i (t x )~T 


( 1 ) 


If we decide to create Aries log for more than one trans¬ 
action in g) x (fi), A (tj,U) should be updated accordingly. 
Let T a C ® x (ti) be the transactions with Aries logs. We 


We first introduce our offline algorithm designed for the 
ideal case, where the impending distribution of transactions 
is known. The offline algorithm is only used to demonstrate 
the basic idea of adaptive logging, while our system employs 
its online variant. We use T to represent all the transactions 
from the last checkpoint to the point of failure. 

For each transaction ti £ T, we compute its benefit as: 

b{ti) = Yj A(ti,tj,Ta) X waries , t ^ 

VtjeTAc(ti)<c(tj) y lJ 

Initially, T a = 0. 

We sort the transactions based on their benefit values. The 
one with the maximal benefit is selected and added to T a - 
All the transactions update their benefits accordingly based 
on Equation[2] This process continues until 

Y ™ ariea (tj)<B i/0 

VtjGTa 

. Algorithm[5]outlines the basic idea of the offline algorithm. 
(j .'j Since we need to re-sort all the transactions after each 
update to T a , the complexity of the algorithm is 0(N 2 ), 
where N is the number of transactions. In fact, full sorting 
is not necessary for most cases, because A(L, tj,T a ) should 
be recalculated, only if both f, and tj update a value of the 
same attribute. 
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Algorithm 5 Offline (TransactionSet T) 

1: Set T a = 0, Map benefits; 

2: for Vfi G T do 

3: bcnefitsl/., ] = computeBenefit(ij) 

4: while getTotalCost(71)< 5;/,, do 
5: sort(benefits) 

6: 71 .addlbcnc fits, key s().lirst(J) 

7: return 71 


3.2.2 Online Algorithm 

Our online algorithm is similar to the offline version, ex¬ 
cept that we must choose either Aries logging or command 
logging in real-time. Since we have no knowledge about 
the distribution of future transactions, we use a histogram 
to approximate the distribution. In particular, for all the at¬ 
tributes A = (u () : ..., «fc) involved in transactions, we record 
the number of transactions that read or write a specific at¬ 
tribute value, and use the histogram to estimate the probabil¬ 
ity of accessing an attribute ai, denoted as P(ai). Note that 
attributes in A may come from the same tuple or different 
tuples. For tuple vq and v\, if both i; 0 . a, and V\ .a, appear in 
A, we will represent th em a s two different attributes. 

tj.A) denotes the trans¬ 


As defined in section 


3.1 


actions that commit between t 3 and f, and also update some 
attributes in tj .A. As a matter of fact, we can rewrite as: 


j , f j, tj.A) — [^J 0(/g • tl , (Lj 

Va,i£tj .A 


Similarly, let S = tj.A — p(T a ,tj )■ The third term of 
Equation[2]can be computed as: 

E rCmd (**) = E ( E pmd (t x )) 

\/t x ^(j){tj,ti,S) Va x £S \/t x £c/)(tj,ti,a x ) 

We use a constant R cmd to denote the average recovery 
cost of command logging. The above Equation can then be 
simplified as: 

E r cmd (t x ) = E { p ( a x)R cmd ) (3) 
Vt x e(j>{tj,ti,S) Va x (zS 

The first term of Equation|2]estimating the cost of recovering 
tj’s time-dependent transactions using command logging 
can be efficiently computed in real-time, if we maintain the 
dependency graph. Therefore, by combining Equation [2] and 
[3] we can estimate the benefit b(t,J of a specific transaction 
during online processing. Suppose we have already created 
Aries logs for transactions in T a . the benefit should be 
updated based on Equation [2] 

The last problem is how to define a threshold 7 . When the 
benefit of a transaction is greater than 7 , we create Aries 
log for it. Let us consider the ideal case. Suppose the node 
fails while processing f, for which we have just created its 
Aries log. This log achieves the maximal benefit which can 


be estimated as: 

b opt = (IN R cmd V P(a x ) - R aries ) x —— 

* ' / > v ' TX/aries 

Va^eA 

where IN denotes the number of transactions before f,, 
Ranes anc | pj janes are [f le aV erage recovery cost and I/O 
cost of Aries log respectively. 

Suppose the failure happens arbitrarily following a Pois¬ 
son distribution with parameter A. That is, the expected aver¬ 
age failure time is A. Let p(s) be the function that returns the 
number of committed transactions in s. Before failure, there 
are approximate p( A) transactions. So the possibly maximal 
benefit is: 

'■* = v n«.) - «"“•) x ^ 

Va x eA 

We define our threshold as 7 = ab opt , where a is a tunable 
parameter. 

Given a limited I/O budget, we can create approximately 
lya/ies Aries log records. As failures may happen ran¬ 
domly at anytime, the log should be evenly distributed over 
the timeline. More specifically, the cumulative distribution 
function (CDF) of the Poisson distribution is 


L ^ i 

P(fail-time < k) = e~ x — 

i= o l ' 

Hence, at the kth second, we can maximally create 

Pi/o 

quota(k) = P{fail-time < k ) grjeg 

log records. When time elapses, we should check whether 
we still have the quota for Aries log. If not, we will not 
create any new Aries log for the time being. 

Finally, we summarize the idea of online adaptive logging 
scheme in Algorithm[6] 

Algorithm 6 Online (Transaction ti, int 
usedQuota) 

1: int q = getQuotal $(!,,;))- usedQuota 

2 : if q > 0 then 

3: Benefit b=computeBenefit(t) 

4: if b > t then 

5: usedQuota++ 

6: createAriesLog(t;) 

7: else 

8: createCommandLog(ti) 


3.3 In-Memory Index 

To help compute the benefit of each transaction, we create 
an in-memory inverted index in our master node. Figure [4] 
shows the structure of the index. The index data are orga¬ 
nized by table ID and tuple ID. For each specific tuple, we 
record the transactions that read or write its attributes. As 
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Order 




Tuple 

10001 

price 

► 

ti 

timestamp 
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-H t 3 | timestamp | W 

number 
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timestamp 
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Tuple 

10023 

price 

—► 

t 2 

timestamp 
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discount 

► 

t 3 

timestamp 

W 

-H ts | timestamp | W 


Table : Part 
Tuple : 83103 

type IF^L tg | timestamp | R | 

Figure 4: In-memory index 

an example, in Figure [4] transaction f 2 reads the number of 
tuple 10001 and updates the price of tuple 10023. 

Using the index, we can efficiently retrieve the time- 
dependent transactions. For transaction f 5 , let A 5 be the 
attributes that it accesses. We search the index to retrieve 
all transactions that update any attribute in A 5 before 1 5 . In 
Figure [4] because discount value of tuple 10023 is updated 
by f 5 , we check its list and find that £3 updates the same value 
before i 5 . Therefore, f 3 is a time-dependent transaction of 
£ 5 . In fact, the index can also be employed to recover the 
dependency graph of transactions. We omit the details as it 
is quite straightforward. 

4. Experimental Evaluation 

In this section, we conduct the runtime cost analysis of our 
proposed adaptive logging and compare its query processing 
and recovery performance against other approaches’. Since 
both traditional Aries logging and command logging are 
already supported by H-Store, for consistency, we imple¬ 
ment distributed command logging and adaptive logging ap¬ 
proaches on top of the H-store as well. In summary, we have 
the following four approaches: 

• ARIES -ARIES logging. 

• Command - command logging proposed in lil9ll . 

• Dis-Command - distributed command logging approach. 

• Adapt-x% - adaptive logging approach, where we cre¬ 
ate Aries log for x% of distributed transactions that 
involve multiple compute nodes. When x=100, adap¬ 
tive logging adopts a simple strategy: Aries logging 
for all distributed transactions and command logging for 
all single-node transactions. 

All the experimental evaluations are conducted on our 
in-house cluster of 17 nodes. The head node is a power¬ 
ful server equipped with an Intel(R) Xeon(R) 2.2 GHz 48- 
core CPU and 64GB RAM, and the compute nodes are blade 
servers with an Intel(R) Xeon(R) 1.8 GHz 8 -core CPU and 
16GB RAM. H-Store is deployed on the 16 compute nodes 
by partitioning the databases evenly. Each node runs a trans¬ 
action site. By default, only 8 sites in H-Store are used, 
except in the scalability test. We use the TPC-C bench- 


marl<0 with 100 clients being run concurrently in the head 
node to submit their transaction requests one by one. As H- 
Store does not support replications, we measure the effect of 
replication using its commercial version VoltDB with Voter 
henchmarl«0T9ll . 

4.1 Runtime Cost Analysis 

We first compare the overheads of different logging strate¬ 
gies during the runtime. In this experiment, we use the num¬ 
ber of New-Order transactions processed per second as the 
metric to evaluate the effect of different logging methods on 
the throughput of the system. To illustrate the behaviors of 
different logging techniques, we adopt two workloads: one 
workload contains only local transactions while the other 
one contains both local and distributed transactions. 

4.1.1 Throughput Evaluation 

Figure [5] shows the throughput of different approaches when 
only local transactions are involved and we vary the client 
rate, namely the total number of transactions submitted by 
all client threads per second. When the client rate is low, the 
system is not saturated and all incoming transactions can be 
completed within a bounded waiting time. Although differ¬ 
ent logging approaches incur different I/O costs, all logging 
approaches show a fairly similar performance due to the fact 
that I/O is not the bottleneck. However, as the client rate in¬ 
creases, the system with Aries logging saturates the earliest 
at around the input rate of 20,000 transactions per second. 
The other approaches (i.e., adaptive logging, distributed log¬ 
ging and command logging), on the other hand, reach the 
saturation point around 30000 trasanctions per second which 
is slightly lower than the ideal case (represented as no log¬ 
ging approach). The throughput of distributed command log¬ 
ging is slightly lower than that of command logging primar¬ 
ily due to the overhead of extra book-keeping involved in 
distributed command logging. 

Figure [6] shows the throughput variation (with log scale 
on y-axis) when there exist distributed transactions. We set 
the client rate to 30,000 transactions per second to keep all 
sites busy and vary the percentage of distributed transac¬ 
tions from 0% to 50%, so that the system performance is 
affected by both network communications and logging. To 
process distributed transactions, multiple sites have to coop¬ 
erate with each other, and as a result, the coordination cost 
typically increases with the number of participating sites. To 
guarantee the correctness at the commit phase, we use the 
two-phase commit protocol which is supported by the H- 
store. In contrast to the local processing shown in Figure 
[5] the main bottleneck of distributed processing gradually 
shifts from logging cost to communication cost. Compared 
to local transaction, distributed transaction always incurs ex- 


2 http://www.tpc.org/tpcc/ 

3 http://hstore.cs.brown.edu/documentation/deployment/benchmarks/voter/ 








































Client rate (txn/s) 



Figure 5: Throughput without 
distributed transactions 


Figure 6: Throughput with dis¬ 
tributed transactions (with log 
scale on y-axis) 


Figure 7: Latency without dis¬ 
tributed transactions 


Figure 8: Latency with dis¬ 
tributed transactions (with log 
scale on y-axis) 


tra network overhead, with which the effect of logging is less 
significant. 

As shown in Figure [ 6 ] when the percentage of distributed 
transactions is less than 30%, the throughput of the other 
logging strategies are still 1.4x better than Aries logging. 
In this experiment, the threshold x of adaptive logging is 
set to 100%, where we create Aries logs for all distributed 
transactions. The purpose is to test the worst performance of 
adaptive logging. 

Command logging is claimed to be more suitable for local 
transactions with multiple updates and Aries logging is 
preferred for distributed transaction with few updates m. 
This claim is true in general. However, since the workload 
does change over time, neither command logging nor Aries 
logging can fully satisfy all access patterns. On the other 
hand, our proposed adaptive logging has been designed to 
adapt to the real time variability in workload characteristics. 

4.1.2 Latency Evaluation 

Latency typically exhibits similar trend to that of throughput, 
but in the opposite direction, and the average latency of dif¬ 
ferent logging strategies is expected to increase as the client 
rate increases. Figure [7] shows that the latency of distributed 
command logging is slightly higher than command logging. 
However, it still performs much better than Aries logging. 
Like other OLTP systems, H-Store first buffers the incoming 
transactions in a transaction queue. The transaction engine 
will pull them out and process them one by one. H-Store 
adopts single-thread mechanism, in which each thread is re¬ 
sponsible for one partition in order to reduce the overhead of 
concurrency control. When the system becomes saturated, 
newly arrived transactions need to wait in the queue, which 
leads to a higher latency. 

Transactions usually commit at the server side which 
sends response information back to the client. Before com¬ 
mitting, all log entries are flushed to disk. The proposed dis¬ 
tributed command logging will materialize command log en¬ 
tries and footprint information before the transactions com¬ 
mit. When a transaction completes, it compresses the foot¬ 
print information and as a result, contributes to a slight de¬ 
lay in response. However, the penalty becomes negligible 


when many distributed transactions are involved. Distributed 
transaction usually incur a higher latency due to the extra 
network overhead. In our experiments, group commit is en¬ 
abled to optimize the disk I/O utility. With an increasing 
number of distributed transactions, the latency is less af¬ 
fected by the logging, all approaches show a similar perfor¬ 
mance as shown in Figure [ 8 ] 

4.1.3 Online Algorithm Cost of Adaptive Logging 

Figure[9]shows the overhead of the online algorithm. We an¬ 
alyze the computation cost of every minute by showing the 
percentages of time taken for making online decisions and 
for processing transactions. The overhead of the online al¬ 
gorithm increases when the system runs for a longer time, 
because more transaction information is maintained in the 
in-memory index. However, we observe that it takes only 5 
seconds to execute the online algorithm in the 8 th minute, 
the main bulk of time is still spent on transaction processing. 
Further, the online decision cost will not grow in an unlim¬ 
ited manner as it is bounded by the checkpointing interval. 
Since the online decision is made before the execution of 
a transaction, we could overlap the computation while the 
transaction is waiting in the transaction queue to further re¬ 
duce the latency. 


4.1.4 Effect of Replication 

Replication is often used to ensure database correctness, 
improve performance, and provide high availability. How¬ 
ever, it also incurs high synchronization overhead to achieve 
strong consistency. Figure [T0a| shows the results with differ¬ 
ent number of replicas, where the number of working execu¬ 
tion sites is fixed to 8 . With a fixed number of sites, creating 
more replicas increases the workload per node, and more 
computation resources (e.g., CPU and memory) are used. 
The performance drops by about 37% when there are three 
replicas. If available resources are limited, the system’s per¬ 
formance is very sensitive to the number for replicas. In Fig¬ 
ure 10b we increase the number sites to 24. Namely, N sites 


are maintaining the original data, while the other 24—A' sites 
are handling the replicas. In this case, each site will have 
the same workload as the non-replication case. However, we 
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Figure 9: Cost of online decision algorithm 


Replicas Number Replicas Number 

(a) Replication with 8 working sites (b) Replication with 24 unique sites 

Figure 10: Effects of replication 


find that if 3 replicas are enabled, the performance degrades 
by 72.5% when compared to no replication. 

4.2 Recovery Cost Analysis 

In this experiment, we evaluate the recovery performance 
of different logging approaches. We simulate two scenarios. 
In the first scenario, we run the system for one minute to 
process the transactions and then shut down an arbitrary site 
to simulate the recovery process. In the second scenario, 
each site will process 30,000 transactions before the process 
of a random site is terminated forcibly so that the recovery 
algorithm can be invoked. In both scenarios, we measure the 
elapsed time to recover the failed site. 



Percentage of distributed transactions 

Figure 11:1 minute after the last checkpoint 



10% 15% 20% 

Percentage of distributed transactions 


Figure 12: After 30,000 transactions committed at each site 

4.2.1 Recovery Evaluation 

Except for Aries logging, the recovery times of the other 
methods are affected by two factors, the number of com¬ 
mitted transactions and the percentage of distributed trans¬ 
actions. Figure [IT] and 12 summarize the recovery times of 
the four logging approaches. Intuitively, the recovery time is 
proportional to the number of transactions that must be re¬ 


percentage of distributed transactions is increased. So even 
though recovering a distributed transaction is costlier, with 
increased percentage of distributed transactions there are a 
fewer number of transactions processed per unit of time. Fig¬ 
ure [IT] demonstrates this trade-off in that the percentage of 
distributed transactions does not adversely affect the recov¬ 
ery times since the cost of recovering distributed transac¬ 
tions is offset by the reduction in the number of distributed 
transaction in a fixed unit of time. For the experiment shown 
in Figure [12] when we require all sites to complete at least 
30,000 transactions, a higher recovery cost is observed with 
the increase in the percentage of distributed transactions. 

In all cases, Aries logging shows the best performance 
and is not affected by the percentage of distributed transac¬ 
tions, while command logging is always the worst. Our dis¬ 
tributed command logging significantly reduces the recov¬ 
ery overhead of the command logging, achieving a 5x im¬ 
provement. The adaptive logging further improves the per¬ 
formance by tuning the trade-off between recovery cost and 
transaction processing cost as discussed below. 

ARIES logging supports independent parallel recovery, 
since each Aries log entry contains one tuple’s data im¬ 
age before and after each operation. Intuitively, the recovery 
time of Aries logging should be less than the time interval 
between checkpointing and the failure time, since read oper¬ 
ations or transaction logics does not need to be repeated dur¬ 
ing the recovery. As a fine-grained logging approach, Aries 
logging is not affected by the percentage of distributed trans¬ 
actions and the workload skew. The recovery time is typi¬ 
cally proportional to the number of committed transactions. 

Command logging incurs much higher overhead when 
performing a recovery process involving distributed transac¬ 
tions (even for a small portion, say 5%). This observation 


can be explained by Figure 13 which shows the recovery 


processed after a failure. In Figure 11 we note that fewer 
transactions can be completed within a given time as the 


time of command logging with one failed site which has 
30,000 committed transactions from the last checkpoint. The 
ideal performance of command logging is achieved by redo¬ 
ing all transactions in all sites without any synchronization. 
Of course, this results in an inconsistent state and we only 
use it here to underscore the overhead of synchronization. If 
no distributed transaction is involved, command logging can 
provide a similar performance as other schemes, because de¬ 
pendencies can be resolved within each site. 
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Table 3: Number of reprocessed transactions 



Figure 13: Synchronization Figure 14: Recovery perfor- 
cost of command logging mance with varying x 


Distributed command logging effectively reduces the 
recovery time compared to the command logging, as shown 
in Figure El On the other hand. Figure [12] shows that the 
performance of distributed command logging is less sensi¬ 
tive to the percentage of distributed transactions when com¬ 
pared to command logging. One additional overhead of dis¬ 
tributed command logging is the cost of scanning the foot¬ 
prints to build the dependency graph. For 1 minute workload, 
the time of building dependency graph increases from 2s to 
5s when the percentage of distributed transactions ranges 
from 5% to 25%. Compared to the total recovery cost, the 
time for building the dependency graph is fairly negligible. 

Adaptive logging technique selectively builds the Aries 
log and command log. To reduce the I/O overhead of adap¬ 
tive logging, in our online algorithm, we set a threshold f?j/ 0 


in online algorithm. So at most, N = tr „‘/° es Aries logs 
can be created. In this experiment, we use a dynamic thresh¬ 
old, by setting N as x percentage of the total number of dis¬ 
tributed transactions. In Figure 11 and 12 a; is set as 40%, 60 
% or 100% to tune the recovery cost and transaction process¬ 
ing cost. In all the tests, the recovery performance of adap¬ 
tive logging is much better than command logging. It only 
performs slightly worse than the pure Aries logging. As 
x increases, more Aries logs are created by adaptive log¬ 
ging, which results in the reduction of the recovery time. In 
the extreme case, we create Aries log for every distributed 
transaction by setting x = 100%. Then, all dependencies of 
distributed transactions are resolved using Aries logs, and 
each site can process its recovery independent of the others. 

Figure [14] shows the effect of x on the recovery per¬ 
formance. We vary the percentage of distributed transac¬ 
tions and show the results with different x values. When 
x = 100%, the recovery times are the same for all, indepen¬ 
dent of the percentage of distributed transactions, because all 
dependencies have been resolved. On the contrary, adaptive 
logging will degrade to distributed command logging, if we 
set x = 0. In this case, more distributed transactions result 
in higher recovery cost. 

Table [3] shows the number of transactions that are repro¬ 
cessed during the recovery in Figure [12] Compared to com¬ 
mand logging, distributed command logging and adaptive 
logging efficiently reduce the number of transactions that 
need to be reprocessed. 


Percentage 

Command 

Dis-Command 

Adapt-40% 

Adapt-60% 

Adapt-100% 

0 % 

30031 

30015 

30201 

30087 

30076 

5% 

239321 

35642 

33742 

32483 

29290 

10 % 

240597 

39285 

36054 

34880 

30674 

15% 

240392 

42979 

39687 

37496 

32201 

20 % 

239853 

48132 

43808 

40912 

33994 

25% 

240197 

57026 

50465 

46095 

35617 


4.2.2 Overall Performance Evaluation 


The intuition of the adaptive logging approach is to bal¬ 
ance the tradeoff between recovery and transaction process¬ 
ing time. It is widely expected that when commodity servers 
are used in a large number, failures are no longer an excep¬ 
tion ED- That is, the system must be able to recover effi¬ 
ciently when a failure occurs and provide a good overall per¬ 
formance. In this set of experiments, we measure the overall 
performances of different approaches. In particular, we run 
the system for three hours and intentionally shut down a ran¬ 
dom node based on a predefined failure rate. The system will 
iteratively process transactions and perform recovery, and a 
new checkpoint is created every 10 minutes. Then, the to¬ 
tal throughput of the entire system is computed as the aver¬ 
age number of transactions processed per second in the three 
hours. 

We show the total throughput for varying failure rate from 


Figure [15a| to Figure 15c with three different mixes of dis¬ 
tributed transactions. Aries logging is superior to the other 
approaches when the failure rate is very high (e.g., there is 
one failure every 5 minutes). When the failure rate is low, 
distributed command logging shows the best performance, 
because it is just slightly slower than command logging for 
transaction processing, but recovers much faster than com¬ 
mand logging. As the failure rate drops, Adapt-100% ap¬ 
proach cannot provide a comparable performance to com¬ 
mand logging, because Adapt-100% creates the Aries log 
for every distributed transaction which is too costly in trans¬ 
action processing. 


4.2.3 Scalability 

In this experiment, we evaluate the scalability of our pro¬ 
posed approaches. In Figure [16] each site processes at least 
30,000 transactions before we randomly terminate one site 
(other sites will detect it as a failed site). The percentage 
of distributed transactions is 10% which are uniformly dis¬ 
tributed among all sites. We observe that command logging 
is not scalable, as the recovery time is linear to the number 
of sites, because all sites need to reprocess their lost transac¬ 
tions. The recovery cost of distributed command logging in¬ 
creases by about 50% when we increase the number of sites 
from 2 to 16. The other logging approaches show a scalable 
recovery performance. Adaptive logging selectively creates 
Aries logs to break dependency relations among compute 
nodes. The number of transactions which are required to be 
reprocessed is greatly reduced during recovery. 
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(a) Overall throughput with 5% (b) Overall throughput with 10% 
distributed transactions distributed transactions 


5. Related Work 

ARIES H2T1 logging is widely adopted for recovery in tradi¬ 
tional disk-based database systems. As a fine-grained log¬ 
ging strategy, Aries logging needs to construct log records 
for each modified tuple. Similar techniques are applied to 
in-memory database systems |7j[TI,, l2l [32 1. 

In d, the authors argue that for in-memory systems, 
since the whole database is maintained in memory, the over¬ 
head of Aries logging cannot be ignored. They proposed a 
different kind of coarse-grained logging strategy called com¬ 
mand logging. It only records transaction’s name and param¬ 
eters instead of concrete tuple modification information. 

Aries log records contain the tuples’ old values and new 
values. Dewitt et alj6) try to reduce the log size by only writ¬ 
ing the new values to log files. However, log records with¬ 
out old values cannot support undo operation. So it needs 
large enough stable memory which can hold the complete 
log records for active transactions. They also try to write log 
records in batch to optimize the disk I/O performance. Simi¬ 
lar techniques such as group commit|j8l are also explored in 
modern database systems. 

Systems HI with asynchronous commit strategy allow 
transactions to complete without waiting log writing re¬ 
quests to finish. This strategy can reduce the overhead of 
log flush to an extent. But it sacrifice database’s durability, 
since the states of the committed transactions can be lost 
when failures happen. 

Lomet et al lfl8l propose a logical logging strategy. The 
recovery phase of Aries logging combines physiological 
redo and logical undo. This work extends Aries to work in 
a logical setting. This idea is used to make Aries logging 
more suitable for in-memory database system. Systems like 
mm adopt this logical strategy. 

If non-volatile RAM is available, database systems lfl7l 
can use it to do some optimizations at the runtime to reduce 
the log size by using shadow pages for updates. With non¬ 
volatile RAM, recovery algorithms proposed by Lehman and 
Garev lfl6l can then be applied. 

There are many research efforts |4] [6, 25, 27 EH l32l 
devoted to efficient checkpointing for in-memory database 
systems. Recent works such asll4l l32l focus on fast check- 




(c) Overall throughput with 20% R 16; ReC overy time V.S. 

distributed transactions , , . 

node number with distributed 

transactions 

pointing to support efficient recovery. Usually checkpoint¬ 
ing techniques need to combine with logging techniques and 
complement with each other to realize reliable recovery pro¬ 
cess. Salem et al |29l survey many checkpointing techniques, 
which cover both inconsistent and consistent checkpointing 
with different logging strategies. 

Johnson et al lfl3l identify logging-related impediments 
to database system scalability. The overhead of log related 
locking/latching contention decreases the performance of 
the database systems, since transactions need to hold locks 
while waiting for the log to write. Works such as lfT3ll23ll24l 
try to make logging more efficient by reducing the effects of 
locking contention. 

RAM-Cloud l22l . a key-value storage for large-scale ap¬ 
plications, replicates node’s memory across nearby disks. It 
is able to support very fast recovery by careful reconstruct¬ 
ing the failed data from many other healthy machines. 

6. Conclusion 

In the context of in-memory databases. Compared to com¬ 
mand logging fl9i shows a much better performance for 
transaction processing compared to the traditional write- 
ahead logging (Aries logging 121] ). However, the trade-off 
is that command logging can significantly increase recov¬ 
ery times in the case of a failure. The reason is that com¬ 
mand logging redoes all transactions in the log since the 
last checkpoint in a serial order. To address this problem, 
we first extend command logging to distributed systems to 
enable all the nodes to perform their recovery in parallel. 
We identify the transactions involved in the failed node by 
analyzing the dependency relations and only redo those in¬ 
volved transactions to reduce the recovery overhead. We 
find that the recovery bottleneck of command logging is the 
synchronization process to resolve data dependency. Con¬ 
sequentially, we design a novel adaptive logging approach 
to achieve an optimized trade-off between the performance 
of transaction processing and recovery. Our experiments on 
H-Store show that adaptive logging can achieve a lOx boost 
for recovery and its transaction throughput is comparable to 
command logging. 


Figure 15: Overall performance evaluation 
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